📈 Time Series Modeling Workflow¶

1. Feature and Target Selection¶

  • Input Features:
    • Calendar-based variables: DayOfWeek, Day, Month, Year, WeekOfYear, DayOfYear
    • Promotional and holiday indicators: Promo, StateHoliday, SchoolHoliday
    • Historical sales statistics: Sales_Lag1, Sales_Lag7, Sales_Lag14, Sales_Rolling7, Sales_Rolling14
  • Target Variable: Sales

2. Chronological Train-Test Split¶

  • Training Set: First 80% of data
  • Test Set: Remaining 20% of data
  • Purpose: To respect the temporal order and prevent data leakage in time series modeling

3. Model Training¶

  • Model: XGBoost regressor
  • Parameters:
    • Number of trees: 100
    • Maximum depth: 5
    • Learning rate: 0.1
    • Loss function: reg:squarederror
  • Training Objective: To learn patterns that explain the variation in daily sales

4. Prediction¶

  • Generate sales predictions using the trained model on the test dataset

5. Model Evaluation¶

  • Metrics:
    • Root Mean Squared Error (RMSE): Measures the average magnitude of prediction errors, penalizing larger errors more heavily
    • Mean Absolute Error (MAE): Represents the average absolute difference between predicted and actual values
  • Purpose: To quantify the model's ability to generalize to new data
import pandas as pd
from datetime import timedeltas
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
df = pd.read_csv('data/store_1.csv')

Understanding the Problem¶

Goal:¶

Predict future store sales using historical data and key features. Key features include:

  • Day of the Week: Understanding weekly sales patterns.
  • Promotions: Analyzing the impact of promotions on sales.
  • Holidays: Considering the effect of holidays on sales.
  • Past Sales Behavior: Utilizing historical sales data to forecast future trends.

This markdown structure

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 928 entries, 0 to 927
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Store          928 non-null    int64 
 1   DayOfWeek      928 non-null    int64 
 2   Date           928 non-null    object
 3   Sales          928 non-null    int64 
 4   Customers      928 non-null    int64 
 5   Open           928 non-null    int64 
 6   Promo          928 non-null    int64 
 7   StateHoliday   928 non-null    int64 
 8   SchoolHoliday  928 non-null    int64 
 9   Id             928 non-null    int64 
dtypes: int64(9), object(1)
memory usage: 72.6+ KB
df.describe()
2025-04-06 13:07:35.820 WARNING streamlit.runtime.scriptrunner_utils.script_run_context: Thread 'MainThread': missing ScriptRunContext! This warning can be ignored when running in bare mode.
2025-04-06 13:07:39.748 WARNING streamlit.runtime.scriptrunner_utils.script_run_context: Thread 'MainThread': missing ScriptRunContext! This warning can be ignored when running in bare mode.

# Group by week and sum sales
weekly_sales = df.resample('W', on='Date')['Sales'].sum()

# Plot
plt.figure(figsize=(12, 6))
weekly_sales.plot(title='Weekly Sales', xlabel='Week', ylabel='Total Sales', marker='o')
plt.grid(True)
plt.show()
No description has been provided for this image

Preprocessing & Cleaning¶

  • Date Conversion:

    • Converted Date column to datetime format.
  • Feature Encoding:

    • Converted StateHoliday to numeric values, as required by XGBoost for handling numeric features.
# Extract date-based features
df["Day"] = df["Date"].dt.day
df["Month"] = df["Date"].dt.month
df["Year"] = df["Date"].dt.year
df["WeekOfYear"] = df["Date"].dt.isocalendar().week
df["DayOfYear"] = df["Date"].dt.dayofyear

📊 Feature Engineering¶

Lag Features¶

  • Sales_Lag1: Captures the sales value from the previous day. This feature helps relate the sales on a given day to the immediate past performance.
  • Sales_Lag7: Captures the sales from the same day of the previous week, useful for identifying weekly patterns.
  • Sales_Lag14: Captures the sales from two weeks ago, aiding in capturing bi-weekly trends.

Rolling Mean Features¶

  • Sales_Rolling7: Computes the average sales over the past week (excluding the current day). This smooths out short-term fluctuations and highlights longer-term trends or cycles.
  • Sales_Rolling14: Computes the average sales over the past two weeks (excluding the current day), providing a broader view of ongoing trends.

Data Cleaning¶

  • After creating these features, rows with NaN values (which arise due to the shift operations) are dropped to ensure a clean dataset perfect for training models.
  • The resulting df_model DataFrame contains no missing values and is ready for further analysis or to be used in machine learning models.

Overall, these transformations enrich the dataset by embedding historical sales information that is crucial for time series forecasting models, allowing them to learn from past patterns more effectively.

# Create lag features and rolling mean features
df["Sales_Lag1"] = df["Sales"].shift(1)
df["Sales_Lag7"] = df["Sales"].shift(7)
df["Sales_Lag14"] = df["Sales"].shift(14)

df["Sales_Rolling7"] = df["Sales"].shift(1).rolling(window=7).mean()
df["Sales_Rolling14"] = df["Sales"].shift(1).rolling(window=14).mean()

# Drop initial rows with NaN values from lag features
df_model = df.dropna().reset_index(drop=True)

df_model
2025-04-04 23:57:11.407 WARNING streamlit.runtime.scriptrunner_utils.script_run_context: Thread 'MainThread': missing ScriptRunContext! This warning can be ignored when running in bare mode.
2025-04-04 23:57:15.447 WARNING streamlit.runtime.scriptrunner_utils.script_run_context: Thread 'MainThread': missing ScriptRunContext! This warning can be ignored when running in bare mode.

XGBoost Regression Model Initialization and Training¶

Key Parameters:¶

  • n_estimators=100: Allows up to 100 trees.
  • max_depth=5: Trees can grow up to 5 levels deep.
  • learning_rate=0.1: Controls the contribution of each tree.
  • objective='reg:squarederror': Uses regression with squared error loss.

This setup is part of a powerful and widely used gradient boosting algorithm.

# Select features and target
features = [
    "DayOfWeek", "Promo", "StateHoliday", "SchoolHoliday",
    "Sales_Lag1", "Sales_Lag7", "Sales_Lag14",
    "Sales_Rolling7", "Sales_Rolling14",
    "Day", "Month", "Year", "WeekOfYear", "DayOfYear"
]
target = "Sales"

X = df_model[features]
y = df_model[target]

# Split data: 80% train, 20% test (chronologically)
split_index = int(len(df_model) * 0.8)
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]

# Train XGBoost model
model = XGBRegressor(n_estimators=100, max_depth=5, learning_rate=0.1, objective='reg:squarederror')
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)

rmse, mae
(526.9812321371766, 377.85596719335337)

Plotting actual vs. predicted sales¶

# Plot actual vs predicted sales
test_dates = df_model["Date"].iloc[len(X_train):].reset_index(drop=True)
plt.figure(figsize=(14, 6))
plt.plot(test_dates, y_test.values, label="Actual Sales", linewidth=2)
plt.plot(test_dates, y_pred, label="Predicted Sales", linewidth=2)
plt.title("Actual vs Predicted Sales (Test Set)")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

🔮 Forecasting Future Sales¶

This step involves predicting future sales by simulating the next 30 days based on historical data and utilizing predictive models.

Process Overview:¶

  1. Initialize Forecasting:

    • Start from the Last Known Date: Determine the last available date in df_model and generate the next 30 days of future dates.
  2. Prepare Data for Forecasting:

    • History Storage: Create a copy of the existing data (df_model) to build lag features for predictions.
    • Initialize Predictions: Create an empty list to store future predictions.
  3. Simulate Daily Predictions:

    • For each date in the 30-day forecast:
      • Feature Engineering:
        • Date Features: Extract attributes like DayOfWeek, Day, Month, Year, WeekOfYear, and DayOfYear.
        • Promotion and Holiday Indicators: Simulate scenarios with "Promo," "StateHoliday," and "SchoolHoliday" indicators set to 0.
      • Lag and Rolling Features:
        • Generate lag features (e.g., Sales_Lag1, Sales_Lag7, Sales_Lag14) based on the historical data.
        • Calculate rolling averages (e.g., Sales_Rolling7, Sales_Rolling14) for a smoother trend analysis.
  4. Predict and Store Results:

    • Use the trained model (model) to predict sales for each day.
    • Collect each prediction and append it to the historical data for future computations.
  5. Compile the Forecast:

    • Compile all daily predictions into a forecast_df DataFrame to organize the future sales and dates effectively.

This structured approach enriches the forecasting capability by incorporating both temporal features and historical patterns, making it ideal for time series predictions

# Start from the last known date
last_date = df_model["Date"].max()
future_dates = pd.date_range(start=last_date + timedelta(days=1), periods=30)

# Store last available history to build lag features
history = df_model.copy()

# Prepare empty frame for forecasts
future_predictions = []

# Simulate 30 days ahead, one day at a time
for future_date in future_dates:
    # Build new row
    row = {
        "Date": future_date,
        "DayOfWeek": future_date.weekday() + 1,  # +1 to match existing format
        "Promo": 0,  # You can simulate different scenarios
        "StateHoliday": 0,
        "SchoolHoliday": 0,
        "Day": future_date.day,
        "Month": future_date.month,
        "Year": future_date.year,
        "WeekOfYear": future_date.isocalendar().week,
        "DayOfYear": future_date.timetuple().tm_yday
    }

    # Compute lag/rolling from history
    for lag in [1, 7, 14]:
        row[f"Sales_Lag{lag}"] = history["Sales"].iloc[-lag]

    row["Sales_Rolling7"] = history["Sales"].iloc[-7:].mean()
    row["Sales_Rolling14"] = history["Sales"].iloc[-14:].mean()

    # Convert to DataFrame and predict
    row_df = pd.DataFrame([row])
    X_future = row_df[features]
    row["Sales"] = model.predict(X_future)[0]

    # Save prediction and add to history
    future_predictions.append(row)
    history = pd.concat([history, pd.DataFrame([row])], ignore_index=True)

# Create forecast DataFrame
forecast_df = pd.DataFrame(future_predictions)
forecast_df = forecast_df[["Date", "Sales"]]

forecast_df
2025-04-05 00:14:45.221 WARNING streamlit.runtime.scriptrunner_utils.script_run_context: Thread 'MainThread': missing ScriptRunContext! This warning can be ignored when running in bare mode.

Sales forecast plot¶

plt.figure(figsize=(12, 5))
plt.plot(forecast_df["Date"], forecast_df["Sales"], label="Forecasted Sales", color="blue")
plt.title("Simulated 30-Day Sales Forecast")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.xticks(rotation=45)
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image